Downloading alphafold2_ptm weights to .: 100%|██████████| 3.47G/3.47G [02:01<00:00, 30.6MB/s]
If we know a protein’s sequence, what 3D structure will it fold into? This protein folding problem has been a longtime dream of biophysicists, biologists, and protein engineers. Solving holds the key to many new discoveries and innovations for research and drug design!
Enter: Deep Learning!
Yet, although we understood much of the individual physical concepts behind protein folding, there was no recipe of equations that could reliably predict what a protein would fold into.
As examples of 3D experimental protein structures increased, the dataset eventually became large enough to train neural networks to predict how proteins folded.
One of the most famous of these is AlphaFold! It kickstarted a revolution in protein folding by drastically outperforming all competitors in CASP, the yearly protein folding competition.
Question: How do you think AF1 and AF2 affected the median percent accuracy in the 2018 and 2020 CASP benchmarks
Hint: Previous scores were around 30-40 Global Distance Score (out of 100)
Code
# @markdown Take a guess!AF1 =70# @param {type:"slider", min:0, max:100, step:10}AF2 =50# @param {type:"slider", min:0, max:100, step:10}year = [2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020]casp_scores = [35, 41, 34, 30, 31, 41, 57, 88]# make bar chartfig = plt.figure(figsize=(10, 5))plt.bar(year, casp_scores)plt.xlabel('Year')plt.ylabel('CASP Score')# add prediction as scatterplt.scatter(2018, AF1, label='AF1', color='purple')plt.scatter(2020, AF2, label='AF2', color='magenta')plt.text(2018, AF1, f'{AF1}', ha='right')plt.text(2020, AF2, f'{AF2}', ha='right')plt.title('CASP Median Free Modeling Accuracy')# Set y axis to 0-100plt.ylim(0, 100)plt.show()
AlphaFold as a Magic Box
For it’s input, AlphaFold takes the sequence and several optional other pieces of information:
MSAs – multiple sequence alignments, genetically similar amino acid sequences inferred from a database.
Pair Features – a pairwise feature representation of the possible residue interactions.
Templates – template proteins that may have similar structure to the unknown protein. These are combined with pair features before thrown into the model’s main modules.
Briefly put, these three inputs are fed into the model, which then spits out a protein and some metrics on how plausible AF thinks the structure is. For more information, see subsection AF Architecture: How Does AF Work?
Let’s get right into it! We have a protein sequence. How do we find it’s structure with AlphaFold?
Here, we will use the ColabFold implementation of AF2 (Mirdita et al. 2022). Usually, we would need to provide MSA alignments, but ColabFold automatically does that for us using MMseqs2. All we have to do is run the cells below, and provide it our sequence (and optionally, templates), and boom, it folds the protein!
Note: When designing de novo proteins, it is recommended to turn MSA off because de novo proteins usually do not have high structural similarities to natural proteins.
Adapted from. For more advanced uses, please visit the ColabFold Github page to find notebooks with more options and parameters. This notebook is mainly intended for educational purposes!
Part 1: Folding a sequence
Code
#@markdown Input your favorite protein sequence, then RUN! (Note: Make sure you are connected to a GPU instance. Should take around 2-5 minutes)from google.colab import filesimport osimport reimport hashlibimport randomfrom sys import version_infopython_version =f"{version_info.major}.{version_info.minor}"def add_hash(x,y):return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]query_sequence ='MTWVRQAPGKGPEWVSGINPDGSSTYYADSVKGRFTISRDNAKNTLYLQMNSLKSEDTALYKCATGAAPRIPTTLRGQGTQVTVSSHHHHHH'#@param {type:"string"}jobname ='learning_af'num_relax =0template_mode ="none"use_amber = num_relax >0# remove whitespacesquery_sequence ="".join(query_sequence.split())basejobname ="".join(jobname.split())basejobname = re.sub(r'\W+', '', basejobname)jobname = add_hash(basejobname, query_sequence)# check if directory with jobname existsdef check(folder):if os.path.exists(folder):returnFalseelse:returnTrueifnot check(jobname): n =0whilenot check(f"{jobname}_{n}"): n +=1 jobname =f"{jobname}_{n}"# make directory to save resultsos.makedirs(jobname, exist_ok=True)# save queriesqueries_path = os.path.join(jobname, f"{jobname}.csv")withopen(queries_path, "w") as text_file: text_file.write(f"id,sequence\n{jobname},{query_sequence}")if template_mode =="pdb100": use_templates =True custom_template_path =Noneelif template_mode =="custom": custom_template_path = os.path.join(jobname,f"template") os.makedirs(custom_template_path, exist_ok=True) uploaded = files.upload() use_templates =Truefor fn in uploaded.keys(): os.rename(fn,os.path.join(custom_template_path,fn))else: custom_template_path =None use_templates =Falseprint("jobname",jobname)print("sequence",query_sequence)print("length",len(query_sequence.replace(":","")))import osmsa_mode ="mmseqs2_uniref_env"pair_mode ="unpaired_paired"# decide which a3m to useif"mmseqs2"in msa_mode: a3m_file = os.path.join(jobname,f"{jobname}.a3m")elif msa_mode =="custom": a3m_file = os.path.join(jobname,f"{jobname}.custom.a3m")ifnot os.path.isfile(a3m_file): custom_msa_dict = files.upload() custom_msa =list(custom_msa_dict.keys())[0] header =0import fileinputfor line in fileinput.FileInput(custom_msa,inplace=1):if line.startswith(">"): header = header +1ifnot line.rstrip():continueif line.startswith(">") ==Falseand header ==1: query_sequence = line.rstrip()print(line, end='') os.rename(custom_msa, a3m_file) queries_path=a3m_fileprint(f"moving {custom_msa} to {a3m_file}")else: a3m_file = os.path.join(jobname,f"{jobname}.single_sequence.a3m")withopen(a3m_file, "w") as text_file: text_file.write(">1\n%s"% query_sequence)model_type ="auto"num_recycles ="3"recycle_early_stop_tolerance ="auto"relax_max_iterations =200pairing_strategy ="greedy"max_msa ="auto"num_seeds =1use_dropout =Falsenum_recycles =Noneif num_recycles =="auto"elseint(num_recycles)recycle_early_stop_tolerance =Noneif recycle_early_stop_tolerance =="auto"elsefloat(recycle_early_stop_tolerance)if max_msa =="auto": max_msa =Nonesave_all =Falsesave_recycles =Falsesave_to_google_drive =Falsedpi =200if save_to_google_drive:from pydrive2.drive import GoogleDrivefrom pydrive2.auth import GoogleAuthfrom google.colab import authfrom oauth2client.client import GoogleCredentials auth.authenticate_user() gauth = GoogleAuth() gauth.credentials = GoogleCredentials.get_application_default() drive = GoogleDrive(gauth)print("You are logged into Google Drive and are good to go!")display_images =Trueimport sysimport warningswarnings.simplefilter(action='ignore', category=FutureWarning)from Bio import BiopythonDeprecationWarningwarnings.simplefilter(action='ignore', category=BiopythonDeprecationWarning)from pathlib import Pathfrom colabfold.download import download_alphafold_params, default_data_dirfrom colabfold.utils import setup_loggingfrom colabfold.batch import get_queries, run, set_model_typefrom colabfold.plot import plot_msa_v2import osimport numpy as nptry: K80_chk = os.popen('nvidia-smi | grep "Tesla K80" | wc -l').read()except: K80_chk ="0"passif"1"in K80_chk:print("WARNING: found GPU Tesla K80: limited to total length < 1000")if"TF_FORCE_UNIFIED_MEMORY"in os.environ:del os.environ["TF_FORCE_UNIFIED_MEMORY"]if"XLA_PYTHON_CLIENT_MEM_FRACTION"in os.environ:del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]from colabfold.colabfold import plot_proteinfrom pathlib import Pathimport matplotlib.pyplot as plt# For some reason we need that to get pdbfixer to importif use_amber andf"/usr/local/lib/python{python_version}/site-packages/"notin sys.path: sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")display_images =Falsedef input_features_callback(input_features):if display_images: plot_msa_v2(input_features) plt.show() plt.close()def prediction_callback(protein_obj, length, prediction_result, input_features, mode): model_name, relaxed = modeifnot relaxed:if display_images: fig = plot_protein(protein_obj, Ls=length, dpi=150) plt.show() plt.close()result_dir = jobnamelog_filename = os.path.join(jobname,"log.txt")setup_logging(Path(log_filename))queries, is_complex = get_queries(queries_path)model_type = set_model_type(is_complex, model_type)if"multimer"in model_type and max_msa isnotNone: use_cluster_profile =Falseelse: use_cluster_profile =Truedownload_alphafold_params(model_type, Path("."))results = run( queries=queries, result_dir=result_dir, use_templates=use_templates, custom_template_path=custom_template_path, num_relax=num_relax, msa_mode=msa_mode, model_type=model_type, num_models=2, num_recycles=2, relax_max_iterations=relax_max_iterations, recycle_early_stop_tolerance=recycle_early_stop_tolerance, num_seeds=num_seeds, use_dropout=use_dropout, model_order=[1,2], # edited is_complex=is_complex, data_dir=Path("."), keep_existing_results=False, rank_by="auto", pair_mode=pair_mode, pairing_strategy=pairing_strategy, stop_at_score=float(100), prediction_callback=prediction_callback, dpi=dpi, zip_results=False, save_all=save_all, max_msa=max_msa, use_cluster_profile=use_cluster_profile, input_features_callback=input_features_callback, save_recycles=save_recycles, user_agent="colabfold/google-colab-main",)results_zip =f"{jobname}.result.zip"os.system(f"zip -r {results_zip}{jobname}")
jobname learning_af_e64d1
sequence MTWVRQAPGKGPEWVSGINPDGSSTYYADSVKGRFTISRDNAKNTLYLQMNSLKSEDTALYKCATGAAPRIPTTLRGQGTQVTVSSHHHHHH
length 92
2024-10-09 17:38:47,646 Running on GPU
2024-10-09 17:38:48,066 Found 5 citations for tools or databases
2024-10-09 17:38:48,067 Query 1/1: learning_af_e64d1 (length 92)
3Dmol.js failed to load for some reason. Please check your browser console for error messages.
Part 2: How do we interpret the results?
If you were to download the outputs (packaged conveniently on the actual ColabFold notebook, or less conveniently in the files for this notebook), you will get several files of interest:
Several “ranked” 3D pdb models
An image detailing the “coverage”
An image detailing the “pLDDT”
An image detailing the “PAE”
Let’s look into what each means!
MSA Coverage - How much similar sequences are there in nature?
“Coverage” is pretty straightforward- when AF looks into its database of sequences, the coverage shows how many matches it was able to assign for parts of the input sequence.
Code
#@markdown Run to show the MSA coverage of your protein - Is there a lot or a little coverage? (hint: more colors = more coverage)from IPython.display import display, HTML, Imageimport base64from html import escape# see: https://stackoverflow.com/a/53688522def image_to_data_url(filename): ext = filename.split('.')[-1] prefix =f'data:image/{ext};base64,'withopen(filename, 'rb') as f: img = f.read()return prefix + base64.b64encode(img).decode('utf-8')cov = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_coverage.png"))display(Image(url=cov, width=800, height=500))
Question: Which protein (left or right) do you think has more sequence coverage coverage? What does it mean to have more coverage?
image.png
Code
#@markdown <font color='#e59454'>Question Time! Run to Show Questionfrom IPython.display import HTML, displayfrom ipywidgets import widgetsdef create_answer_box(question, correct_answer):def display_answer(button): user_answer = text_area.value.strip() result.value =f"<pre class='answer-text'><b>{correct_answer}</b></pre>"# Create widgets text_area = widgets.Textarea( placeholder='Enter your answer here...', layout={'width': '95%', 'height': '100px'} ) submit_button = widgets.Button(description="Submit") result = widgets.HTML(value="")# Set up the layout vbox = widgets.VBox([text_area, submit_button, result])# Add button click event submit_button.on_click(display_answer)# Display the widget display(HTML(f"<b>{question}</b>")) display(vbox)# Add custom CSS for text wrapping display(HTML(""" <style> .widget-html-content { width: 95%; white-space: normal; word-wrap: break-word; } .answer-text { width: 90%; color: green; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; font-family: inherit; font-size: inherit; margin: 0; padding: 0; } </style> """))# Example usagecreate_answer_box('Which one has more coverage?', 'Trick question!!! The one one the left has lots of coverage, as it is a highly documented natural protein (Snase), with many similar sequences in the MSA database. On the other hand, the one on the right is a complex folded without MSAs. We want to do this when folding de novo proteins- as they are not as evolutionarily related/structrually related to similar sequences found in nature, and would mess with the inference.')
Which one has more coverage?
pLDDT - predicted local distance difference test score
pLDDT can be interpreted as the per-residue confidence that AF has in its prediction. A higher pLDDT means the prediction is more accurate.
From Deepmind’s website,
>90 = High accuracy
70-90 = Modeled well
50-70 = Low confidence
<50 = Probably disordered region
The overall pLDDT is used to rank the outputs (and the structure we visualized is colored by pLDDT – blue is higher and red is lower)
Code
#@markdown Run to display your pLDDT score - Is it high confidence? Where might AF2 be less confident?from IPython.display import display, HTML, Imageimport base64from html import escape# see: https://stackoverflow.com/a/53688522def image_to_data_url(filename): ext = filename.split('.')[-1] prefix =f'data:image/{ext};base64,'withopen(filename, 'rb') as f: img = f.read()return prefix + base64.b64encode(img).decode('utf-8')cov = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_plddt.png"))display(Image(url=cov, width=800, height=500))
Question: Between the left and right folded proteins, which one is more likely to mirror experimental results?
image.png
Code
#@markdown <font color='#e59454'>Question Time! Run to Show Questionfrom IPython.display import HTML, displayfrom ipywidgets import widgetsdef create_answer_box(correct_answer):def display_answer(button): user_answer = text_area.value.strip() result.value =f"<pre class='answer-text'><b>{correct_answer}</b></pre>"# Create widgets text_area = widgets.Textarea( placeholder='Enter your answer here...', layout={'width': '95%', 'height': '100px'} ) submit_button = widgets.Button(description="Submit") result = widgets.HTML(value="")# Set up the layout vbox = widgets.VBox([text_area, submit_button, result])# Add button click event submit_button.on_click(display_answer)# Display the widget display(HTML("<b>Which one is AF more confident in? Why?</b>")) display(vbox)# Add custom CSS for text wrapping display(HTML(""" <style> .widget-html-content { width: 95%; white-space: normal; word-wrap: break-word; } .answer-text { width: 90%; color: green; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; font-family: inherit; font-size: inherit; margin: 0; padding: 0; } </style> """))# Example usagecreate_answer_box('Alphafold has more confidence in the right one! This is because the plddt is higher (there are less "spikes")')
Which one is AF more confident in? Why?
PAE - predicted aligned error
Represents AF’s confidence for the interaction of protein domains and residues. Given in a contact map between all residues.
The color at (x, y) corresponds to the expected distance error in residue x’s position, when the prediction and true structure are aligned on residue y.
Lower is better: * Low = well defined positions/orientation (blue) * High = uncertain relative positions (red)
For example: If a nanobody binds well to an antigen, we would expect to see more “blue” in the interaction regions between the two chains (see the question below for an example)
Code
#@markdown Run to view PAE charts - take time to understand what the x and y axes mean!from IPython.display import display, HTML, Imageimport base64from html import escape# see: https://stackoverflow.com/a/53688522def image_to_data_url(filename): ext = filename.split('.')[-1] prefix =f'data:image/{ext};base64,'withopen(filename, 'rb') as f: img = f.read()return prefix + base64.b64encode(img).decode('utf-8')pae =""pae_file = os.path.join(jobname,f"{jobname}{jobname_prefix}_pae.png")if os.path.isfile(pae_file): pae = image_to_data_url(pae_file)display(HTML(f"""<div style="max-width:90%; padding:2em;">{'<!--'if pae ==''else''}<img src="{pae}" class="full" />{'-->'if pae ==''else''}</div>"""))
Question: The following is a PAE diagram between two proteins, A and B. Which of the four quadrants (1,2,3,4) will detail any interactions between the two proteins? Bonus: Is the PAE diagram a perfect mirror across the x-y axis?
image.png
Code
#@markdown <font color='#e59454'>Question Time! Run to Show Questionfrom IPython.display import HTML, displayfrom ipywidgets import widgetsdef create_answer_box(correct_answer):def display_answer(button): user_answer = text_area.value.strip() result.value =f"<pre class='answer-text'><b>{correct_answer}</b></pre>"# Create widgets text_area = widgets.Textarea( placeholder='Enter your answer here...', layout={'width': '95%', 'height': '100px'} ) submit_button = widgets.Button(description="Submit") result = widgets.HTML(value="")# Set up the layout vbox = widgets.VBox([text_area, submit_button, result])# Add button click event submit_button.on_click(display_answer)# Display the widget display(HTML("<b>Which quadrants represent intermolecular interactions? Is the figure mirrored? Why or why not?</b>")) display(vbox)# Add custom CSS for text wrapping display(HTML(""" <style> .widget-html-content { width: 95%; white-space: normal; word-wrap: break-word; } .answer-text { width: 90%; color: green; line-height: 1.5; white-space: pre-wrap; word-wrap: break-word; font-family: inherit; font-size: inherit; margin: 0; padding: 0; } </style> """))# Example usagecreate_answer_box("2 & 3 show weak interactions between the two proteins. However, it should be noted that the PAE diagram is not exactly a perfect mirror. If there is different aligned error based on which residue we align on, the difference that will be reflected in the diagram")
Which quadrants represent intermolecular interactions? Is the figure mirrored? Why or why not?
More Files
Below is a summary of all the files produced after running AlphaFold
image.png
Section 3: AF Architecture– How Does AF Work?
AF utilized many innovations, including multi-track architecture, multiple-sequence alignments, end-to-training, attention, and its own confidence estimator. Don’t worry if you don’t understand these terms, we will go into it!
Below are the three main components of AF2’s architecture that we talked about before:
image.png
Part 1: Representations of our input sequence
Protein sequence information
Protein sequence information fed to AlphaFold 2 are stored as a fasta file. Here’s an example below:
Building a multiple sequence alignment and template structure
The input amino acid sequence is used as a query to several databases (normally UniRef) of protein sequences and constructs a multiple sequence alignment (MSA). An MSA identifies similar (but not identical) sequences that have been identified in living organisms. How does one go from an alignment to a structure? The theory is that residues that coevolve are generally close to each other in the protein’s folded state.
The similar sequences pulled from also have structural information. These templates can be converted into distance matrices to determine the distance between residues
image.png
Code
#@markdown **<font color='#e59454'>Question Time! Run to Show Question**import randomfrom IPython.display import display, HTMLdef create_quiz(question, correct_answer, decoy_answers):# Combine the correct answer with the decoy answers answers = [correct_answer] + decoy_answers# Shuffle the answers random.shuffle(answers)# Generate the HTML code html_code =f""" <div style="font-family: Arial, sans-serif; margin: 30px; max-width: 1000px;"> <h3 style="word-wrap: break-word;">{question}</h3> """# Add each answer as a radio buttonfor i, answer inenumerate(answers): html_code +=f""" <div style="word-wrap: break-word; margin-bottom: 10px;"> <input type="radio" id="option{i}" name="quiz" value="{answer}"> <label for="option{i}" id="label{i}">{answer}</label> </div> """# Add the submit button and feedback section html_code +=f""" <button onclick="checkAnswer()">Submit</button> <p id="feedback" style="font-weight: bold;"></p> </div> <script> function checkAnswer() {{ var radios = document.getElementsByName('quiz'); var correctAnswer = "{correct_answer}"; var feedback = document.getElementById('feedback'); var selectedOption = null; var selectedLabel = null; // Get the selected option for (var i = 0, length = radios.length; i < length; i++) {{ if (radios[i].checked) {{ selectedOption = radios[i].value; selectedLabel = document.getElementById('label' + i); break;}}}} // Provide feedback if (selectedOption) {{ if (selectedOption === correctAnswer) {{ feedback.textContent = "Correct!"; feedback.style.color = "green"; selectedLabel.style.color = "green";}} else {{ feedback.textContent = "Incorrect."; feedback.style.color = "red"; selectedLabel.style.color = "red";}}}} else {{ feedback.textContent = "Please select an answer."; feedback.style.color = "orange";}}}} </script> """ display(HTML(html_code))# Examplequestion ="What is an MSA and why is it important for AF2?"correct_answer ="An MSA is a sequence alignment of protein sequences that identifies similar, but not identical, sequences across different organisms. It is important because it helps determine which parts of the sequence are more likely to mutate and allows the detection of correlations between amino acids, providing crucial evolutionary information for predicting protein structure."decoy_answers = ["An MSA is a matrix of sequence alignments that highlights the most conserved regions in a protein sequence. For AF2, it’s important because it helps in quantifying the stability of different protein folds by analyzing these conserved regions.","An MSA is a method for annotating sequences based on their biochemical properties. It’s essential for AF2 as it helps classify sequences into functional categories, which can then be used to infer protein structure.","An MSA is a multi-sequence alignment used to determine the secondary structure of proteins. It is crucial for AF2 because it provides a visual representation of how protein structures align with one another, aiding in the prediction of tertiary structures."]create_quiz(question, correct_answer, decoy_answers)
What is an MSA and why is it important for AF2?
Here, we also try to identify templates, which are proteins that may have similar structure to that of the input sequence. This is used to construct an initial representation of the structure, also known as the pair representation. By identifying conserved fragments, the model uses this as a guide to construct a structure.
Code
#@markdown **<font color='#e59454'>Question Time! Run to Show Question**import randomfrom IPython.display import display, HTMLquestion ="What is the concept of “pair representation” in AF2?"correct_answer ="Pair representation in AF2 is a model of the likelihood that pairs of amino acids will be in contact with each other. It is an intermediate representation that is refined through the Evoformer module and used to inform the final structure prediction of the protein."decoy_answers = ["Pair representation in AF2 is a method for visualizing the three-dimensional arrangement of amino acid side chains. It provides AF2 with a detailed spatial map of side chains, crucial for understanding the overall protein shape.","Pair representation in AF2 denotes the alignment of protein sequences based on evolutionary relationships. It allows AF2 to identify homologous pairs of sequences and infer structural similarities.","Pair representation in AF2 is a technique for quantifying the hydrophobic interactions between amino acids. It’s used to estimate the internal energy of the protein structure and refine predictions of protein folding."]create_quiz(question, correct_answer, decoy_answers)
What is the concept of “pair representation” in AF2?
Part 2: Evoformer
Here, we take the MSA and templates and pass them through a pair of transformers. Doing so (1) identifies which pieces of information are more informative, (2) iteratively exchanges information between MSA and pair representations, and (3) refines the representations for both the MSA and pair interactions.
Below is a deeper look into the Evoformer (which is composed of an MSA transformer and pair transformer). Left image: The MSA transformer identifies a correlation between the two columns of the MSA, each corresponding to a residue. This information is passed to the pair representation, where subsequently the pair representation identifies another possible interaction. Right image: Information is passed back to the MSA. The MSA transformer receives an input from the pair representation and observes another pair of columns that exhibit a significant correlation.
image.png
Code
#@markdown **<font color='#e59454'>Question Time! Run to Show Question**import randomfrom IPython.display import display, HTMLquestion ="What is the role of the Evoformer in the AF2 architecture?"correct_answer ="The Evoformer is responsible for extracting and refining information from the MSA and templates. It uses a transformer architecture to iteratively exchange information between the MSA and pairwise interactions, improving the model's characterization of protein structure through multiple cycles."decoy_answers = ["The Evoformer functions as a classifier that categorizes amino acid sequences into predefined structural classes. It uses a supervised learning approach to label sequences based on known protein structures.","The Evoformer is a post-processing step that refines the final protein structure by adjusting predicted coordinates to fit known experimental data. It fine-tunes the structure predictions based on empirical measurements.","The Evoformer is a data preprocessing module that normalizes and filters raw protein sequences before they are input into the main prediction model. It ensures that only high-quality sequences are used for structure prediction."]create_quiz(question, correct_answer, decoy_answers)
What is the role of the Evoformer in the AF2 architecture?
The pair of transformers works on the principle of Triangle inequality, where the sum of two sides must be greater than or equal to the third side. Attention is arranged in terms of triangles of residues. Using this theorem, we can determine the likely distance residues have from one another because the distance between three points can never break that theorem
image.png
Part 3: Structure module
The structure module takes a refined MSA representation and refined pair representation, and uses this to construct a 3D model of the structure. The end result is a long list of cartesian coordinates representing the position of each atom of the protein (including side chains). Proteins are represented as a residue gas, where every amino acid is modeled as a triangle with points at the Nitrogen, R group Carbon and the Alpha Carbon. At the beginning of the structure module, all residues are placed at the origin of coordinates. At every step, the model produces a set of matrices that displace/rotate the residues in space. This representation does not reflect any physical or geometric assumptions, and as a result the network has a tendency to generate structural violations, which you can see in this video
image.png
Code
#@markdown **<font color='#e59454'>Question Time! Run to Show Question**import randomfrom IPython.display import display, HTMLquestion ="How does the structure module of AF2 generate the final protein structure?"correct_answer ="The structure module of AlphaFold 2 takes the refined MSA and pair representations and constructs a three-dimensional model of the protein structure. It treats the protein as a 'residue gas,' where amino acids are represented as triangles. The module uses Invariant Point Attention (IPA) to ensure the structure is invariant to translations and rotations."decoy_answers = ["The structure module of AlphaFold 2 utilizes a lattice-based approach to construct the protein model, where amino acids are placed on a fixed grid. It then applies optimization algorithms to minimize structural energy and achieve the final protein structure.","The structure module uses a generative adversarial network (GAN) to create the protein structure. It generates multiple possible structures and selects the best one based on adversarial feedback from a discriminator network.","The structure module generates the final protein structure by combining sequence motifs with a template library of known protein structures. It matches the input sequence to the closest template and adjusts the model accordingly."]create_quiz(question, correct_answer, decoy_answers)
How does the structure module of AF2 generate the final protein structure?
Code
#@markdown **<font color='#e59454'>Question Time! Run to Show Question**import randomfrom IPython.display import display, HTMLquestion ="What is invariant point attention (IPA) and why is it important in AF2?"correct_answer ="IPA is used in the structure module to ensure the model's predictions are invariant to translations and rotations of the protein. This is important because it allows the model to generalize better and requires less data to learn accurate protein structures."decoy_answers = ["IPA is a technique used in AF2 to perform dimensionality reduction on protein data. It simplifies the representation of the protein structure by projecting it onto a lower-dimensional space, which helps in speeding up the computation.","IPA is a preprocessing step that aligns protein sequences with known structural templates. It ensures that the input sequences are appropriately matched with template structures before being processed by the main prediction model.","IPA is used to predict several possible orientations of the given protein, that way, more detailed information can be captured about its structure as it undergoes iterative refinement."]create_quiz(question, correct_answer, decoy_answers)
What is invariant point attention (IPA) and why is it important in AF2?
Code
#@markdown **<font color='#e59454'>Question Time! Run to Show Question**import randomfrom IPython.display import display, HTMLquestion ="What is the Frame Aligned Point Error (FAPE) and how is it used in AF2?"correct_answer ="FAPE is a loss function that measures the accuracy of predicted atomic positions in the protein structure. It is similar to RMSD, but includes additional properties to prevent creating proteins with incorrect chirality"decoy_answers = ["FAPE is an evaluation metric that measures the similarity between predicted protein structures and a library of experimental structures. It helps assess the accuracy of predictions by comparing them to known, experimentally determined structures.","FAPE is a regularization term added to the loss function to penalize large deviations in predicted bond angles. It helps ensure that the model predicts more physically realistic protein structures by constraining bond angles.","FAPE is a normalization technique used to adjust the scale of protein structures. It ensures that all predicted structures are on the same scale, which helps in comparing predictions across different models."]create_quiz(question, correct_answer, decoy_answers)
What is the Frame Aligned Point Error (FAPE) and how is it used in AF2?
An important note is that this model works iteratively. After generating a structure, the model takes all the information (MSA representation, pair representation, and predicted structure) and passes it back to the beginning of the Evoformer blocks. Check this video out of the model refining it’s predictions across blocks over multiple iteractions.
Code
#@markdown **<font color='#e59454'>Question Time! Run to Show Question**import randomfrom IPython.display import display, HTML# Define the question and answersquestion ="How does AF2 refine its predictions iteratively?"correct_answer ="Predictions are iteratively refined by taking the predicted structure, MSA representation, and pair representation from one cycle and using them as input for another round of processing through the Evoformer module. The iterative process allows the model to continuously improve its predictions."decoy_answers = ["AF2 utilizes a hierarchical neural network architecture, where different levels of the network specialize in distinct aspects of protein structure, refining predictions through a top-down feedback loop.","AF2 leverages ensemble learning, combining outputs from various independent models, and iteratively refines predictions by giving more weight to models that better align with known protein structures.","AF2 applies a genetic algorithm approach, where multiple variations of protein structures are generated and selectively combined in each iteration, akin to evolutionary selection, to improve predictions."]create_quiz(question, correct_answer, decoy_answers)
How does AF2 refine its predictions iteratively?
Onwards!
Alphafold’s architecture and concepts have inspired other deep-learning based protein folding algorithms as well, such as RosettaTTAFold2 and ESMFold. Each model has its advantages and disadvantages, dependent on its accuracy (with respect to various protein types), which oftentimes trades off with runtime (more accurate models take longer to run). Based on your desired task, whether it be designing a de novo protein or investigating a natural complex, it may be beneficial to choose a particular model to use. However, that is beyond the scope of this tutorial.
Another active field of progress is predicting folding with ligands and post-translational modifications. At the time of writing, the state of the art includes RosettaTTAFold All Atom, AF3, HelixFold3, and Chai1. These models are been able to account for ligands in folding, such as DNA or inhibitors. In particular, HelixFold-single and Chai1 look to circumvent MSA usage by using protein language models instead! Feel free to check all of them out!